Goto

Collaborating Authors

 semi-supervised training


UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Liu, Jiaxuan, Xiang, Yang, Zhao, Han, Li, Xiangang, Gao, Yingying, Zhang, Shilei, Ling, Zhenhua

arXiv.org Artificial Intelligence

Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-V alence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotional annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities. Code and demos are available at: https://anonymous.4open.science/ It is designed for large-scale emotional speech datasets and integrates discrete label and dimensional ADV annotations to enable controllable emotional TTS. These models leverage the strong language understanding of LLMs to generate speech semantic tokens from text tokens, thereby achieving significant advantages in synthesizing expressive speech. Current LLM-based methods primarily rely on emotion prompts for supervised fine-tuning.


Info-Coevolution: An Efficient Framework for Data Model Coevolution

Qin, Ziheng, Xu, Hailun, Yew, Wei Chee, Jia, Qi, Luo, Yang, Sarkar, Kanchan, Guan, Danhui, Wang, Kai, You, Yang

arXiv.org Artificial Intelligence

Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32\% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50\% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at https://github.com/NUS-HPC-AI-Lab/Info-Coevolution/.


A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic

Klejch, Ondřej, Lamb, William, Bell, Peter

arXiv.org Artificial Intelligence

An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.


Entriever: Energy-based Retriever for Knowledge-Grounded Dialog Systems

Cai, Yucheng, Li, Ke, Huang, Yi, Feng, Junlan, Ou, Zhijian

arXiv.org Artificial Intelligence

A retriever, which retrieves relevant knowledge pieces from a knowledge base given a context, is an important component in many natural language processing (NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog systems to improve knowledge acquisition. In knowledge-grounded dialog systems, when conditioning on a given context, there may be multiple relevant and correlated knowledge pieces. However, knowledge pieces are usually assumed to be conditionally independent in current retriever models. To address this issue, we propose Entriever, an energy-based retriever. Entriever directly models the candidate retrieval results as a whole instead of modeling the knowledge pieces separately, with the relevance score defined by an energy function. We explore various architectures of energy functions and different training methods for Entriever, and show that Entriever substantially outperforms the strong cross-encoder baseline in knowledge retrieval tasks. Furthermore, we show that in semi-supervised training of knowledge-grounded dialog systems, Entriever enables effective scoring of retrieved knowledge pieces and significantly improves end-to-end performance of dialog systems.


Confidence-Guided Data Augmentation for Improved Semi-Supervised Training

Khmaissia, Fadoua, Frigui, Hichem

arXiv.org Artificial Intelligence

We propose a new strategy to improve the accuracy and robustness of image classification. First, we train a baseline CNN model. Then, we identify challenging regions in the feature space by identifying all misclassified samples, and correctly classified samples with low confidence values. These samples are then used to train a Variational AutoEncoder (VAE). Next, the VAE is used to generate synthetic images. Finally, the generated synthetic images are used in conjunction with the original labeled images to train a new model in a semi-supervised fashion. Empirical results on benchmark datasets such as STL10 and CIFAR-100 show that the synthetically generated samples can further diversify the training data, leading to improvement in image classification in comparison with the fully supervised baseline approaches using only the available data.


Semi-supervised 3D Object Detection via Temporal Graph Neural Networks

Wang, Jianren, Gang, Haiming, Ancha, Siddarth, Chen, Yi-Ting, Held, David

arXiv.org Artificial Intelligence

3D object detection plays an important role in autonomous driving and other robotics applications. However, these detectors usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging large amounts of unlabeled point cloud videos by semi-supervised learning of 3D object detectors via temporal graph neural networks. Our insight is that temporal smoothing can create more accurate detection results on unlabeled data, and these smoothed detections can then be used to retrain the detector. We learn to perform this temporal reasoning with a graph neural network, where edges represent the relationship between candidate detections in different time frames. After semi-supervised learning, our method achieves state-of-the-art detection performance on the challenging nuScenes and H3D benchmarks, compared to baselines trained on the same amount of labeled data. Project and code are released at https://www.jianrenw.com/SOD-TGNN/.


Cost-effective speech-to-text with weakly- and semi-supervised training

AIHub

Voice assistants equipped with speech-to-text technology have seen a major boost in performance and usage, thanks to the new powerful machine learning methods based on deep neural networks. These methods follow a supervised learning approach, requiring large amounts of paired speech-text data to train the best performing speech-to-text transcription models. After collecting large amounts of relevant and diverse spoken utterances, the complex and intensive task of annotating and labelling of the collected speech data awaits. To get a feel for a typical scenario, let's look at some estimates. On average a typical user query, for example "Do you have the Christmas edition with Santa?", would last for about 3 seconds.


Autoencoder-based cleaning in probabilistic databases

Mauritz, R. R., Nijweide, F. P. J., Goseling, J., van Keulen, M.

arXiv.org Artificial Intelligence

In the field of data integration, data quality problems are often encountered when extracting, combining, and merging data. The probabilistic data integration approach represents information about such problems as uncertainties in a probabilistic database. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance. I. Introduction Data quality problems are a large threat in data science. Specifically, in the field of dataintegration, i.e., combining several data sources into a single and unified view [1], often uncertainties are encountered when extracting, combining, and merging data. These uncertainties can result from the nature of data (such as noise in measurements), but can also be a result of the integration process itself. Information about these uncertainties is considered an important result of the integration process [2]. In probabilistic data integration (PDI), the integration result as well as information on uncertainty is stored in a probabilistic database (PDB) [3]. The PDB maintains possible alternatives for values and records, their likelihoods, and the dependencies among them. The PDI process (see Figure 1) consists of two main phases. This phase is followed by the improvement phase, which gathers evidence while the data is being used for the purpose of gradually improving its quality.


Twin Neural Network Regression is a Semi-Supervised Regression Algorithm

Wetzel, Sebastian J., Melko, Roger G., Tamblyn, Isaac

arXiv.org Artificial Intelligence

Twin neural network regression (TNNR) is a semi-supervised regression algorithm, it can be trained on unlabelled data points as long as other, labelled anchor data points, are present. TNNR is trained to predict differences between the target values of two different data points rather than the targets themselves. By ensembling predicted differences between the targets of an unseen data point and all training data points, it is possible to obtain a very accurate prediction for the original regression problem. Since any loop of predicted differences should sum to zero, loops can be supplied to the training data, even if the data points themselves within loops are unlabelled. Semi-supervised training improves TNNR performance, which is already state of the art, significantly.


Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Biswas, A., de Wet, F., van der Westhuizen, E., Niesler, T. R.

arXiv.org Machine Learning

We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.